library(lessR)
library(ggplot2)
data <- read.csv(file="NutritionStudy.csv",head=TRUE,sep=",", stringsAsFactors = TRUE)
The Equation for this model is \(Y=193.701+3.813X_{1}\).
The interpretation is that if someone doesn’t eat fiber, their cholestorol will be 193.701.
For every unit of fiber they eat, their cholestorol will increase by 3.813.
The \(R^2\) is .02, which means it only explains 2.0% of the variance. This indicates that the fiber variable is not enough to predict cholestorol. The variable is significant though, at a 5% cutoff, so we reject the null hyptohesis that the slope is 0. So any relationship we establish between these 2 variables can be real (and not due to chance).
This may be a variable I would keep throughout the analysis
The F test:
\(H_{0}: \beta_{1} = 0\)
\(H_{a}: \beta_{i} \neq 0\)
Looking at the ANOVA and the F statistic, we can reject the null hypotheis that \(\beta = 0\).
While the \(R^2\) is low because it doesnt explain all the variance, the hypothesis tests is telling us that the variable does have predictive power and that the slope not equal to 0 is not due to chance.
Checking assumptions:
Linearity - we do see more or less a linear pattern, until the < 30 fiber counts. But we dont see any non-linear patterns like a polynomial, exponential, etc etc.
Homoscedasticity - The residuals vs fitted graph dont appear to be completly constant. Towards the higher end of the fitted values, we dont see the same distribution around the x=0 as the other values.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends. multicollinearity - Since we are only using 1 predictor, this does not apply.
The scale location plot tells the same story as the residual plot, but something interesting is that looking at the red line, the magnitude of the residuals are fairly flat across all X’s.
levearage: We dont see any outliers as all the points are within the cooks distance line.
model1 <- lm(Cholesterol~Fiber, data=data)
summary(model1)
##
## Call:
## lm(formula = Cholesterol ~ Fiber, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -216.48 -88.58 -34.54 61.18 652.10
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 193.701 19.157 10.111 < 0.0000000000000002
## Fiber 3.813 1.383 2.757 0.00618
##
## Residual standard error: 130.6 on 313 degrees of freedom
## Multiple R-squared: 0.02371, Adjusted R-squared: 0.02059
## F-statistic: 7.6 on 1 and 313 DF, p-value: 0.006179
plot(model1)
anova(model1)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## Fiber 1 129684 129684 7.6002 0.006179
## Residuals 313 5340757 17063
The Equation for this model is \(Y=189.266+3.813X_{1}-2.523X_{2}+44.429X_{3}\).
The interpretation is as follows:
when all betas are equal to 0 (no influence in any of the variables or if the value is -), the cholestorol is 189.266.
Fiber Variable:
For alcohol = 0 as fiber increases 1 unit, cholestorol increases 3.813 units.
Alcohol Variable:
using 0 alcohol consumption as the basis for comparison, for someone who drinks >0 alcohol and <10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then decrease by -2.523.
using 0 alcohol consumption as the basis for comparison, for someone who drink >10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then increasee it by +44.429
The adjusted \(R^2\) is .02363, which means it only explains 2.4% of the variance. The p value for fiber is significant while the p value for alcohol is insignificant.
The F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3}= 0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Looking at the ANOVA and also looking at the F statistic in the summary table, we see p-value at .015 and we do see signficant Pr(>F). With this, we can reject the null hypothesis that \(H_{0}\) that all \(\beta\) = 0.
While the \(R^2\) is low because it doesnt explain all the variance, the hypothesis tests is telling us that the variable does have predictive power and that the slope not equal to 0 is not due to chance.
Checking assumptions:
Homoscedasticity - The residuals vs fitted graph dont appear to be completly constant. Towards the higher end of the fitted values, we dont see the same distribution around the x=0 as the other values. (Same as the prior chart with only fiber as a variable)
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends.
multicollinearity - Plotting a simple scatter of fiber vs alcohol, we do not see obvious signs of multicollinearity.
The scale location plot tells the same story as the residual plot, but something interesting is that looking at the red line, the magnitude of the residuals are fairly flat across all X’s.
levearage: We dont see any outliers as all the points are within the cooks distance line.
data$Alcohol_d0 <- ifelse(data$Alcohol==0,1,0)
data$Alcohol_d1 <- ifelse(data$Alcohol<10 & data$Alcohol>0,1,0)
data$Alcohol_d2 <- ifelse(data$Alcohol>=10,1,0)
model2 <- lm(Cholesterol~Fiber + Alcohol_d1 + Alcohol_d2, data=data)
summary(model2)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218.31 -91.83 -32.24 64.65 654.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.266 21.065 8.985 < 0.0000000000000002
## Fiber 3.984 1.389 2.868 0.00441
## Alcohol_d1 -2.523 15.836 -0.159 0.87352
## Alcohol_d2 44.429 28.429 1.563 0.11912
##
## Residual standard error: 130.4 on 311 degrees of freedom
## Multiple R-squared: 0.03296, Adjusted R-squared: 0.02363
## F-statistic: 3.533 on 3 and 311 DF, p-value: 0.01518
anova(model2)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## Fiber 1 129684 129684 7.6239 0.006102
## Alcohol_d1 1 9065 9065 0.5329 0.465927
## Alcohol_d2 1 41545 41545 2.4423 0.119117
## Residuals 311 5290147 17010
plot(model2)
plot(data$Fiber,data$Alcohol)
Plotting a scatter plot of fitted versus fiber and grouped by alcoholgroup, we see a straight line for each alcohol group. The lines are all parallel to each other (with the same slope).
Now, when we plot the actual versus fiber, and color code based on alcohol group, we see a much more complex and interwined scatter plot.
This is telling me that the fitted model did not fit the observed data very well. The fitted seemed to oversimplfy the relationship beteween fiber and cholestorol. A more complex model may be needed.
alcoholdata <-data
alcoholdata$fitted = model2$fitted.values
alcoholdata$alcoholgroup <- ifelse(alcoholdata$Alcohol_d0==1,1,ifelse(alcoholdata$Alcohol_d1==1,2,3))
ggplot(data=alcoholdata, aes(x=Fiber,
y=fitted,
color = factor(alcoholgroup)))+geom_point() + labs(title = "fiber vs cholestorol by group")
#+ geom_smooth(method=lm, aes(fill=factor(alcoholgroup)))
Plot(Fiber,Cholesterol, by = alcoholgroup, fit = TRUE, data=alcoholdata)
## >>> Suggestions
## Plot(Fiber, Cholesterol, out_cut=.10) # label top 10% potential outliers
## Plot(Fiber, Cholesterol, ellipse=0.95, add="means") # 0.95 ellipse with means
## Plot(Fiber, Cholesterol, enhance=TRUE) # many options, including the above
## Plot(Fiber, Cholesterol, shape="diamond") # change plot character
##
##
## >>> Pearson's product-moment correlation
##
## Number of paired values with neither missing, n = 315
##
##
## Sample Correlation of Fiber and Cholesterol: r = 0.154
##
##
## Alternative Hypothesis: True correlation is not equal to 0
## t-value: 2.757, df: 313, p-value: 0.006
##
## 95% Confidence Interval of Population Correlation
## Lower Bound: 0.044 Upper Bound: 0.260
The Equation for this model is \(Y=230.3434+.6363X_{1}-62.8481X_{2}-63.3814X_{3}+4.7976X_{5} + 9.0742X_{6}\).
The interpretation is as follows:
when all betas are equal to 0 (no influence in any of the variables), the cholestorol is 230.3434
Fiber Variable:
For alcohol = 0 as fiber increases 1 unit, cholestorol increases .6363 units.
Alcohol Variable:
using 0 alcohol consumption as the basis for comparison, for someone who drinks >0 alcohol and <10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then decrease by -62.848
using 0 alcohol consumption as the basis for comparison, for someone who drink >10 units, their total cholestorol would be the same as someone who drinks 0 alcohol, but you would then decrease it by 63.3814
Interacton Variables: Looking at the interaction of <10 alcohol and fiber, the coefficient of 4.7976 tells us that the original slope of .6363 is now (4.7976 + .6363) = 5.4339.
Looking at the interaction of >10 alcohol and fiber, the coefficient of 9.0742 is telling us that the the original slope of .6363 is now (.6363 + 9.0742) = 9.7105.
This gives us the impression that higher alcohol volume combined with fiber, yields a higher rate of change in cholestorol. The adjusted \(R^2\) is .028, which means it only explains 2.8% of the variance. The p value for fiber is significant while the p value for alcohol and it’s interactions is insignificant.
The F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3}= \beta_{4} = \beta_{5} =0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Looking at the ANOVA and the F statistic and also looking at the F statistic in the summary table, we see p-value at .015 and we do see some signficant Pr(>F). With this, we can reeject the null hypothesis that \(H_{0}\) that all \(\beta\) = 0.
Checking assumptions:
Homoscedasticity - The residuals vs fitted graph dont appear to be completly constant. Towards the higher end of the fitted values, we dont see the same distribution around the x=0 as the other values. (Same as the prior chart with only fiber as a variable)
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends. multicollinearity - Plotting a simple scatter of fiber vs alcohol, we do not see obvious signs of multicollinearity.
The scale location plot tells the same story as the residual plot, but something interesting is that looking at the red line, the magnitude of the residuals are fairly flat across all X’s.
levearage: We dont see any outliers as all the points are within the cooks distance line.
alcoholdata$fxd0 <- alcoholdata$Fiber*alcoholdata$Alcohol_d0
alcoholdata$fxd1 <- alcoholdata$Fiber*alcoholdata$Alcohol_d1
alcoholdata$fxd2 <- alcoholdata$Fiber*alcoholdata$Alcohol_d2
model3 <- lm(Cholesterol~Fiber + Alcohol_d1 + Alcohol_d2 + fxd1 + fxd2, data=alcoholdata)
alcoholdata$unequalslopefitted <- model3$fitted.values
summary(model3)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2 +
## fxd1 + fxd2, data = alcoholdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184.25 -88.39 -25.85 64.40 661.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 230.3434 31.5413 7.303 0.00000000000241
## Fiber 0.6363 2.3655 0.269 0.788
## Alcohol_d1 -62.8481 40.5528 -1.550 0.122
## Alcohol_d2 -63.3814 85.4549 -0.742 0.459
## fxd1 4.7976 2.9565 1.623 0.106
## fxd2 9.0742 6.8735 1.320 0.188
##
## Residual standard error: 130.1 on 309 degrees of freedom
## Multiple R-squared: 0.04366, Adjusted R-squared: 0.02819
## F-statistic: 2.821 on 5 and 309 DF, p-value: 0.01651
anova(model3)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## Fiber 1 129684 129684 7.6597 0.005987
## Alcohol_d1 1 9065 9065 0.5354 0.464888
## Alcohol_d2 1 41545 41545 2.4538 0.118265
## fxd1 1 29048 29048 1.7157 0.191221
## fxd2 1 29508 29508 1.7429 0.187754
## Residuals 309 5231592 16931
plot(model3)
ggplot(data=alcoholdata, aes(x=Fiber,
y=unequalslopefitted,
color = factor(alcoholgroup)))+geom_point() + labs(title = "fiber vs cholestorol with interaction")
The full model is the model with the interactions and the reduced model is the model without the interactions.
Nested F test:
\(H_{0}: \beta_{4} = \beta_{5}=0\)
\(H_{a}: \beta_4 \neq 0\) or \(\beta_5 \neq 0\)
The nested F test tells us that the betas for the interactions are not signficant, so we can’t reject the null hypthesis that the slope is 0. This means that we can’t definitively say that there is a relationship between the interactions to cholestorol. And since we arent rejecting the null hypothesis, we can’t say that the slopes are different.
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2
## Model 2: Cholesterol ~ Fiber + Alcohol_d1 + Alcohol_d2 + fxd1 + fxd2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 311 5290147
## 2 309 5231592 2 58556 1.7293 0.1791
creating dummies for smoke, vitamins, and gender
data$Smoke_d0 <- ifelse(data$Smoke=="No",1,0)
data$Smoke_d1 <- ifelse(data$Smoke=="Yes",1,0)
data$VitaminUse_d0 <- ifelse(data$VitaminUse=="No",1,0)
data$VitaminUse_d1 <- ifelse(data$VitaminUse=="Occasional",1,0)
data$VitaminUse_d2 <- ifelse(data$VitaminUse=="Regular",1,0)
data$Gender_d0 <- ifelse(data$Gender=="Female",1,0)
data$Gender_d1 <- ifelse(data$Gender=="Male",1,0)
#Interactions
data$fxsmoked0 <- data$Fiber*data$Smoke_d0
data$fxsmoked1 <- data$Fiber*data$Smoke_d1
data$fxVitaminUsed0 <- data$Fiber*data$VitaminUse_d0
data$fxVitaminUsed1 <- data$Fiber*data$VitaminUse_d1
data$fxVitaminUsed2 <- data$Fiber*data$VitaminUse_d2
data$fxGenderd0 <- data$Fiber*data$Gender_d0
data$fxGenderd1 <- data$Fiber*data$Gender_d1
Smoke and Fiber.
smokedata <- data
smokedata$smokegroup <- ifelse(smokedata$Smoke_d0==1,1,2)
The Equation for this model is as follows:
Full Model: \(Y=179.184+4.455_{1}+63.059X_{2}-1.597X_{3}\).
Reduced Model: \(181.274 +4.296X_{1}+45.738X_2\).
Interpretation:
For the Full model, when all betas are 0, the cholestorol is 179.184.
When smoke = ‘No’, then cholestorol will be increased by 4.455 for every unit of fiber increase
When smoke = ‘Yes’, cholestorol will be the same as when smoke = ‘No’, but 63.059 will be added to the total cholestorol.
With the interaction of fiber and smoke, for smoke = ‘Yes’, instead of 4.455 for every unit increase, the unit increase will be (4.455-1.597) = 2.858.
The \(R^2\) for both models are low, (2.8 and 3.1 for full and reduced models), so they do not explain all the variance. The p-values are still insignficant except for fiber, so we dont reject the null hypothesis that the betas are equal to 0. This also means that the relationship we see between smoke and cholestorol may be due to chance.
smoke_full <-lm(Cholesterol~Fiber + Smoke_d1 + fxsmoked1, data=smokedata)
smokedata$fitted.full <- smoke_full$fitted.values
summary(smoke_full)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + Smoke_d1 + fxsmoked1, data = smokedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -218.86 -87.71 -35.15 65.11 657.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 179.184 20.875 8.583 0.000000000000000447
## Fiber 4.455 1.471 3.028 0.00267
## Smoke_d1 63.059 55.002 1.146 0.25248
## fxsmoked1 -1.597 4.661 -0.343 0.73218
##
## Residual standard error: 130.1 on 311 degrees of freedom
## Multiple R-squared: 0.03789, Adjusted R-squared: 0.02861
## F-statistic: 4.082 on 3 and 311 DF, p-value: 0.007277
smoke_reduced <-lm(Cholesterol~Fiber + Smoke_d1, data=smokedata)
smokedata$fitted.reduced <- smoke_reduced$fitted.values
summary(smoke_reduced)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + Smoke_d1, data = smokedata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -216.77 -87.79 -35.81 65.54 657.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181.274 19.936 9.093 < 0.0000000000000002
## Fiber 4.296 1.394 3.081 0.00224
## Smoke_d1 45.738 21.611 2.116 0.03510
##
## Residual standard error: 129.9 on 312 degrees of freedom
## Multiple R-squared: 0.03752, Adjusted R-squared: 0.03135
## F-statistic: 6.082 on 2 and 312 DF, p-value: 0.002563
F tests an Nested F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} =0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Our F test for both full and reduced model is telling us that again, fiber is significant, and smoke is significant at 3.5%
For the Nested F test:
\(H_{0}: \beta_{3}=0\)
\(H_{a}: \beta_3 \neq 0\)
We are not able to reject the null hypothesis with the F statistic.
Using all the above info, it appears that 1) Fiber is a significant variable, 2) The F tests is telling us we are able to reject the null hypothesis for smoke (excluding the interaction), and 3)the interaction term is also not significant, so we can potentially omit this interaction variable in the model. This is also telling us that the unequal slopes may not be real.
anova(smoke_full)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## Fiber 1 129684 129684 7.6630 0.005975
## Smoke_d1 1 75590 75590 4.4666 0.035360
## fxsmoked1 1 1986 1986 0.1173 0.732179
## Residuals 311 5263182 16923
anova(smoke_reduced)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## Fiber 1 129684 129684 7.6847 0.005904
## Smoke_d1 1 75590 75590 4.4793 0.035100
## Residuals 312 5265167 16876
anova(smoke_full, smoke_reduced)
## Analysis of Variance Table
##
## Model 1: Cholesterol ~ Fiber + Smoke_d1 + fxsmoked1
## Model 2: Cholesterol ~ Fiber + Smoke_d1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 311 5263182
## 2 312 5265167 -1 -1985.6 0.1173 0.7322
Checking assumptions:
Homoscedasticity - The residuals do look random, but it’s not constant throughout all of the fitted values. towards the higher end of the fitted values, it is not longer spread around the 0 mark.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends.
The scale location plot tells the same story as the residual plot: random residuals, but the standarized residuals seem to increase just a tad towards the higher ends of the fitted values
levearage: We dont see any outliers as all the points are within the cooks distance line.
In terms of goodness of it, I feel this is not a perfect dataset that satisifies all linear regression assumptions, but it’s not terrible. Any issues with the prediction result would be more caused by the variable selection than assumption violations.
plot(data$Fiber,data$Smoke)
ggplot(data=smokedata, aes(x=Fiber,
y=fitted.reduced,
color = factor(smokegroup)))+geom_point() + labs(title = "smoke and fiber")
ggplot(data=smokedata, aes(x=Fiber,
y=fitted.full,
color = factor(smokegroup)))+geom_point() + labs(title = "smoke and fiber with interactions")
plot(smoke_full)
plot(smoke_reduced)
fiber and vitamins
vitamindata <- data
vitamindata$vitamingroup <- ifelse(vitamindata$VitaminUse_d0==1,1,ifelse(vitamindata$VitaminUse_d1==1,2,3))
The Equation for this model is as follows:
Full Model: \(Y=208.821+3.111X_1 -19.453X_2 -29.942X_3 + 1.300X_4 + 1.1196X_5\).
Reduced Model: \(198.745 + 3.940X_1 -3.401X_2 - 14.947X_3\).
Interpretation:
For the Full model, when all betas are 0, the cholestorol is 208.821
When vitaminuse = ‘No’, then cholestorol will be increased by 3.111 for every unit of fiber increase.
When vitaminuse = ‘occasional’, cholestorol will be the same as when vitaminUse = ‘No’, but -19.453 will be added to the total cholestorol.
When vitaminuse = ‘regular’, cholestorol will be the same as when vitaminUse = ‘No’, but -29.942 will be added to the total cholestorol.
The Reduced model is interpreted similary where we just shift the fiber coefficient by the vitamin coefficient, minus the interaction components
Interaction terms With the interaction of fiber and VitaminUse, for vitaminUse = ‘Occasional’, instead of 3.111 for every unit increase, the unit increase will be (3.111+1.300) = 4.411
With the interaction of fiber and vitumanUse, for vitaminUse = ‘Regular’, instead of 3.111 for every unit increase, the unit increase will be (3.111+1.1196) = 4.307.
The is all assuming basis of interpretation for vitaminUse= “No”
The \(R^2\) for both models are low, (.01 and .016 for full and reduced models), so they do not explain all the variance. The p-values are still insignficant except for fiber, so we dont reject the null hypothesis that the betas are equal to 0. This also means that the relationship we see between vitamin and cholestorol may be due to chance.
vitamin_full <-lm(Cholesterol~Fiber + VitaminUse_d1 + VitaminUse_d2 + fxVitaminUsed1 + fxVitaminUsed2, data=vitamindata)
vitamindata$fitted.full <- vitamin_full$fitted.values
summary(vitamin_full)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2 +
## fxVitaminUsed1 + fxVitaminUsed2, data = vitamindata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -214.64 -91.71 -33.55 63.36 659.80
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 208.821 32.308 6.463 0.000000000399
## Fiber 3.111 2.454 1.267 0.206
## VitaminUse_d1 -19.453 52.883 -0.368 0.713
## VitaminUse_d2 -29.942 43.947 -0.681 0.496
## fxVitaminUsed1 1.300 3.945 0.329 0.742
## fxVitaminUsed2 1.196 3.188 0.375 0.708
##
## Residual standard error: 131.3 on 309 degrees of freedom
## Multiple R-squared: 0.02681, Adjusted R-squared: 0.01106
## F-statistic: 1.702 on 5 and 309 DF, p-value: 0.1338
vitamin_reduced <-lm(Cholesterol~Fiber + VitaminUse_d1+VitaminUse_d2, data=vitamindata)
vitamindata$fitted.reduced <- vitamin_reduced$fitted.values
summary(vitamin_reduced)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2,
## data = vitamindata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -209.93 -88.01 -35.04 62.02 660.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 198.745 20.990 9.469 < 0.0000000000000002
## Fiber 3.940 1.393 2.828 0.00498
## VitaminUse_d1 -3.401 19.074 -0.178 0.85861
## VitaminUse_d2 -14.947 17.259 -0.866 0.38714
##
## Residual standard error: 130.9 on 311 degrees of freedom
## Multiple R-squared: 0.02627, Adjusted R-squared: 0.01688
## F-statistic: 2.797 on 3 and 311 DF, p-value: 0.04034
F tests an Nested F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} = \beta_4 = \beta_5 = 0\)
\(H_{a}:\) at least one \(\beta_{i} \neq 0\)
Our F test for both full and reduced model is telling us that again, fiber is significant, but the remaing variables are not.
For the Nested F test:
\(H_{0}: \beta_{3} = \beta_4=0\) \(H_{a}: \beta_i \neq 0\)
We are not able to reject the null hypothesis with the F statistic.
Using all the above info, it appears that 1) Fiber is a significant variable, 2) The F tests is telling us we are not able to reject the null hypothesis, so it’s possible that any relationship generated by this model we see to cholestorol is due to chance, and 3)the interaction term is also not significant, so we can potentially omit this variable in the model. This is also telling us that the unequal slopes may not be real.
anova(vitamin_full)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## Fiber 1 129684 129684 7.5270 0.006433
## VitaminUse_d1 1 1181 1181 0.0686 0.793631
## VitaminUse_d2 1 12846 12846 0.7456 0.388547
## fxVitaminUsed1 1 501 501 0.0291 0.864663
## fxVitaminUsed2 1 2425 2425 0.1407 0.707808
## Residuals 309 5323804 17229
anova(vitamin_reduced)
## Analysis of Variance Table
##
## Response: Cholesterol
## Df Sum Sq Mean Sq F value Pr(>F)
## Fiber 1 129684 129684 7.5716 0.006278
## VitaminUse_d1 1 1181 1181 0.0690 0.793034
## VitaminUse_d2 1 12846 12846 0.7500 0.387144
## Residuals 311 5326730 17128
anova(vitamin_full,vitamin_reduced)
## Analysis of Variance Table
##
## Model 1: Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2 + fxVitaminUsed1 +
## fxVitaminUsed2
## Model 2: Cholesterol ~ Fiber + VitaminUse_d1 + VitaminUse_d2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 309 5323804
## 2 311 5326730 -2 -2926.1 0.0849 0.9186
Checking assumptions:
Homoscedasticity - The residuals do look random, but it’s not constant throughout all of the fitted values. towards the higher end of the fitted values, it is not longer spread around the 0 mark.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends. The scale location plot tells the same story as the residual plot: random residuals, but the standarized residuals seem to increase just a tad towards the higher ends of the fitted values levearage: We dont see any outliers as all the points are within the cooks distance line.
In terms of goodness of it, I feel this is not a perfect dataset that satisifies all linear regression assumptions, but it’s not terrible. Any issues with the prediction result would be more caused by the variable selection than assumption violations.
plot(vitamin_full)
ggplot(data=vitamindata, aes(x=Fiber,
y=fitted.reduced,
color = factor(vitamingroup)))+geom_point() + labs(title = "vitamins and fiber")
ggplot(data=vitamindata, aes(x=Fiber,
y=fitted.full,
color = factor(vitamingroup)))+geom_point() + labs(title = "vitamins and fiber with interactions")
Fiber and Gender
Genderdata <- data
Genderdata$Gendergroup <- ifelse(Genderdata$Gender_d0==0,1,2)
fiber and gender
The Equation for this model is as follows:
Full Model: \(Y=162.359+5.273X_1 +311.514X_2 -16.138X_3\).
Reduced Model: \(184.490 + 3.529X_1 +96.294X_2\).
Interpretation:
For the Full model, when all betas are 0, the cholestorol is 162.359
When Gender = ‘Female’, then cholestorol will be increased by 5.273 for every unit of fiber increase. When Gender = ‘Male’, cholestorol will be the same as when Gender = ‘Female’, but 311.514 will be added to the total cholestorol.
The Reduced model is interpreted similary where we just shift the fiber coefficient by the gender coefficient, minus the interaction components
Interaction terms With the interaction of fiber and gender, for gender = ‘Male’, instead of 5.273 for every unit increase, the unit increase will be (5.273-16.138)= -10.865
The is all assuming basis of interpretation for Gender = “Female”
The \(R^2\) for both models are low, (.11 and .07 for full and reduced models), so they do not explain all the variance. The p-values are still insignficant except for fiber, so we dont reject the null hypothesis that the betas are equal to 0. This also means that the relationship we see between vitamin and cholestorol may be due to chance.
Gender_full <-lm(Cholesterol~Fiber + Gender_d1 + fxGenderd1, data=Genderdata)
Genderdata$fitted.full <- Gender_full$fitted.values
summary(Gender_full)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + Gender_d1 + fxGenderd1, data = Genderdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -299.55 -80.27 -25.28 53.23 662.41
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 162.359 19.188 8.462 0.00000000000000105
## Fiber 5.273 1.391 3.790 0.000181
## Gender_d1 311.514 60.083 5.185 0.00000039029294889
## fxGenderd1 -16.138 4.233 -3.812 0.000166
##
## Residual standard error: 124 on 311 degrees of freedom
## Multiple R-squared: 0.1261, Adjusted R-squared: 0.1177
## F-statistic: 14.96 on 3 and 311 DF, p-value: 0.000000004028
Gender_reduced <-lm(Cholesterol~Fiber + Gender_d1, data=Genderdata)
Genderdata$fitted.reduced <- Gender_reduced$fitted.values
summary(Gender_reduced)
##
## Call:
## lm(formula = Cholesterol ~ Fiber + Gender_d1, data = Genderdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -296.10 -85.19 -30.31 56.56 665.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 184.490 18.681 9.876 < 0.0000000000000002
## Fiber 3.529 1.342 2.629 0.00898
## Gender_d1 96.294 21.013 4.583 0.00000665
##
## Residual standard error: 126.6 on 312 degrees of freedom
## Multiple R-squared: 0.08527, Adjusted R-squared: 0.07941
## F-statistic: 14.54 on 2 and 312 DF, p-value: 0.0000009149
F tests an Nested F test:
\(H_{0}: \beta_{1} = \beta_{2} = \beta_{3} = 0\)
\(H_{a}:\)at least one \(\beta_{i} \neq 0\)
Our F test for both full and reduced model is telling us that again, fiber is significant gender is also significant
For the Nested F test:
\(H_{0}: \beta_{2} = \beta_3=0\)
\(H_{a}: \beta_2 \neq 0\) or \(\beta_3\)
We are able to reject the null hypothesis with the F statistic.
Using all the above info, it appears that 1) Fiber, gender, and the interaction terms is significant, 2) The F tests is telling us we are able to reject the null hypothesis, so we can say that gender is significant in this model, and also, the interaction is does truly yield unequal slopes.
anova(Gender_full, Gender_reduced)
## Analysis of Variance Table
##
## Model 1: Cholesterol ~ Fiber + Gender_d1 + fxGenderd1
## Model 2: Cholesterol ~ Fiber + Gender_d1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 311 4780527
## 2 312 5003953 -1 -223427 14.535 0.0001659
Checking assumptions:
Homoscedasticity - The residuals do look random, but it’s not constant throughout all of the fitted values. towards the higher end of the fitted values, it is not longer spread around the 0 mark.
residual normality - the QQ plot does appear to show that the residuals are not perfectly normally distributed towards the top ends.
The scale location plot tells the same story as the residual plot: random residuals, but the standarized residuals seem to increase just a tad towards the higher ends of the fitted values
levearage: We dont see any outliers as all the points are within the cooks distance line.
In terms of goodness of it, I feel this is not a perfect dataset that satisifies all linear regression assumptions, but it’s not terrible. Any issues with the prediction result would be more caused by the variable selection than assumption violations.
plot(Gender_full)
ggplot(data=Genderdata, aes(x=Fiber,
y=fitted.reduced,
color = factor(Gendergroup)))+geom_point() + labs(title= "fiber vs gender")
ggplot(data=Genderdata, aes(x=Fiber,
y=fitted.full,
color = factor(Gendergroup)))+geom_point() + labs(title = "fiber vs gender with interactions")
Comparison of all variables
For alcohol: Without interactions, all alcohol groupings are insignificant. P value for F statistic is .015 \(R^2\) is 2.3.
With interactions, all alcohol groupings are insignificant. p value for F statistic is .016. \(R^2\) is 2.8.
Nested F test: we fail to reject the interaction variable.
For smoke
Without interactions, all variables are significant. \(R^2\) for smoke is .031. p value for F statistic is .0025.
With interactions, all smoke variables are not significant. \(R^2\) is .028. p value for F statistic is .007
Nested F test: we fail to reject the interaction variable.
for vitaminUse
Without interactions, all vitamin variables are not significant. \(R^2\) is 1.6%. P value for F statistic is .1338
With interactions, all vitamin variables are not significant. \(R^2\) is 1.6. P value for F statistic is .04
Nested F test: We fail to reject the interaction variable.
For Gender
Without interaction, all variables are significant. \(R^2\) is .079. P value for F statistic is very low.
With interaction, all variables are significant \(R^2\) is 11.8%. P value for the F statistic is extremely low.
Nested F test: the p value for the F statistic is very low.
Based on the 4 comparisons, I feel gender is the best predictor for fiber. It has the highest Correlation coefficient, so we know it explains the most variance. all the variables came out to be signficant, so we know that any relationship to fiber coming out of gender is probabilistically not due to chance. The F test which assesses the overall model fit is also significant, so we are reassured of the performance of the model. Finally, the nested F test tells us the interaction variable is a good predictor, so we can say that each gender group is different from each other.
I realized that for the prior assignments, I wasn’t methodlogical in assessing my models, so for this assignment and future ones for the rest of the term, I’ve decided to take more effort in addressing each piece of the assumptions. Because of this, I started to realize the “tediousness” of model validation.
One aspect of the assupmotions that I’m still not comfortable in is looking at the residual vs fitted plot. The first one of Fiber vs cholestorol made me ask myself a lot of questions: It looks random, but it’s not “constant” across all X’s. Does that disqualify “constant” variance? It’s not a perfect represenation of “constant” variance, but its also not a perfect represenation of something that violates constant variance (i,e “fanning” shape).
With the new tools we have addressing unequal slopes, I do see the benefit of using this knowledge, but I think the more tests we have, the more complexity and care we have to take in addressing each component. I feel like as we do more of these assignments, it may be good to have a “checklist” of things to look out for (i,e 1. address assumptions, 2. look at scatters, 3. look at interactions, 4. look at un equal slopes, 5. look at F tests etc etc).
Thinking about unequal slopes, I do think about how to use this in my workplace. The most difficult part is knowing which interactions to test (I do wonder, is there a scoring procedure to assess which interactions have the most difference in impact (before we finalize the selection)). Then, in terms of interpreting the unequal slopes, the visual is a good way to show how different the groups are.
Unfortunately, life got the best of me this week, so I dont have time to try other variables, but if given the time, I would probably try a couple things:
1. figure out some sort of loop to loop through a few more variables.
2. maybe try the interactions of 2 interactions?